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Empirical Tests of Zipf s law Mechanism In Open Source Linux Distribution 
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The evolution of open source software projects in Linux distributions offers a remarkable example 
of a growing complex self-organizing adaptive system, exhibiting Zipf 's law over four full decades. 
We present three tests of the usually assumed ingredients of stochastic growth models that have been 
previously conjectured to be at the origin of Zipf 's law: (i) the growth observed between successive 
releases of the number of in-directed links of packages obeys Gibrat's law of proportional growth; 
(ii) the average growth increment of the number of in-directed links of packages over a time interval 
At is proportional to At, while its standard deviation is proportional to V^At; (iii) the distribution 
of the number of in-directed links of new packages appearing in evolving versions of Debian Linux 
distributions has a tail thinner than Zipf 's law, with an exponent which converges to the Zipf 's law 
value 1 as the time At between releases increases. 

PACS numbers: 89.75.Ak; 89.75.Da; 02.50.Ey 



Complex adaptive systems in nature and society of- 
ten exhibit scale-free properties, either in their self- 
organizing fractal geometry and/or their self-similar sta- 
tistical distributions. Among the many such characteris- 
tics, Zipf's law plays a particular role as one of the few 
quantitative reproducible regularities found in the social 
sciences. Zipf's law usually refers to probability density 
functions p{x) of some stochastic variable x, usually a 
size or frequency, exhibiting the power law dependence 



p{x) ~ 1/x 



l+M 



with /i = 1 . 



(1) 



Initially formulated as a rank-frequency relationship 
quantifying the relative commonness of words in natural 
languages [1], Zipf's law accounts remarkably well for the 
distribution of city sizes [2] as well as firm sizes [3|, [j, |5[ 
all over the world. Recently, Zipf's law has also been 
found in Web access statistics and Internet traffic char- 
acteristics [sl, ll3| as well as in bibliometrics, informetrics, 
scientometrics, and library science (see [7] and references 
therein). 

Starting with Yule [8] and Schumpeter [9], it is now 
recognized that there are important links between such 
size distributions and growth. On this basis, Simon [lO[ 
articulated a simple mechanism for Zipf's law based on 
Gibrat's law of proportionate effect [ll| implemented in 
a stochastic growth model with new entrants. A mod- 
ern formulation of Gibrat's law is that growth is a ran- 
dom process, with successive stochastic realizations of 
the growth rates that are independent of the size of the 
entity (city, firm, website popularity and so on). In the 
context of the distribution of firm sizes, Simon [10] mod- 
ified Gibrat's model by accounting for the entry of new 
firms over time as the overall industry grows. This model 
has recently been rediscovered under the name "preferen- 
tial attachment" to explain the scale-free networks found 



in social communities, the world-wide web, or networks 
of p roteins reacting with each other in biological cells 
|l2L ll3[ . But the existence of new entrants in the growth 
process is just one of the many different additional ingre- 
dients coniplementing Gibrat's law that yields Zipf's law 

UMMm 

While several works have tested Gibrat's law directly 
(and its deviations) in various contexts, and have conjec- 
tured on its relevance to explain Zipf's law, we present 
the first fully consistent empirical study showing that the 
usually assumed ingredients of stochastic growth models 
are indeed present in a system exhibiting Zipf's law. For 
this, we provide an empirical analysis of the growth of an 
operating system (Debian Linux) based on open source 
softwares. Large Linux distributions typically contain 
tens of thousands of connected packages, including the 
operating system and applications, which form a com- 
plex web of inter-dependencies. A measure of the "cen- 
trality" of a given package is the number of other pack- 
ages that call it in their routine, a measure we refer to as 
the number of in-directed links or connections that other 
packages have to a given package. We find that the distri- 
bution of in-directed links of packages in successive De- 
bian Linux distributions precisely obeys Zipf's law over 
four orders of magnitudes. We then verify explicitly that 
the growth observed between successive releases of the 
number of in-directed links of packages obeys Gibrat's 
law with a good approximation. As an additional criti- 
cal test of the stochastic growth process, we confirm em- 
pirically that the average growth increment of the num- 
ber of in-directed links of packages over a time interval 
At is proportional to At, while its standard deviation 
is proportional to V^At, as predicted from Gibrat's law 
implemented in a standard stochastic growth model. In 
addition, we verify that the distribution of the number of 
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FIG. 1: (Color Online) Log- log plot of the number of 
packages in four Debian Linux Distributions with more 
than C in-directed links. The four Debian Linux Distri- 
butions are Woody (19.07.2002) (orange diamonds), Sarge 
(06.06.2005) (green crosses), Etch (15.08.2007) (blue circles), 
Lenny (15.12.2007) (black +'s). The inset shows the Maxi- 
mum Likelihood estimate (MLE) of the exponent fi together 
with two boundaries defining its 95% confidence interval (ap- 
proximately given by l±2/y^, where n is the number of data 
points using in the MLE), as a function of the lower threshold. 
The MLE has been modified from the standard Hill estimator 
to take into account the discreteness of C. 



in-directed links of new packages appearing in evolving 
version of Debian Linux distributions has a tail thinner 
than Zipf 's law, confirming that Zipf 's law in this system 
is controlled by the growth process. 

The Linux Kernel was created in 1991 by Linus Tor- 
valds as a clone of the proprietary Unix operating system 
171 . Ilq , and was licensed under GNU General Public Li- 
cense. Its code and open source license had immediately 
a strong appeal to other developpers who contributed to 
its further development. Quickly, the community of open 
source developers started to run other open source pro- 
grams on this new operating system. In 1993, Debian 
Linux [l9[ became the first non-commercial successful 
general distribution of an open source operating system. 
While continuously evolving, it remains up to the present 
the "mother" of a dominant Linux branch, competing 
with a growing number of derived distributions (Ubuntu, 
Dreamlinux, Damn Small Linux, Knoppix, Kanotix, and 
so on). 

From a few tens to hundreds of packages (474 in 1996 
(vl.l)), Debian has expanded to include more than about 
IS'OOO packages in 2007, with many intricate dependen- 
cies between them, that can be represented by complex 
functional networks. Debian offers a remarkable exam- 
ple of a growing complex self-organizing adaptive system 



[20|| . Its evolution is recorded by a chronological series of 
stable and unstable releases: new packages enter, some 
disappear, others gain or lose connectivity. Here, we 
study the following sequence of Debian releases: Woody: 
19.07.2002; Sarge: 0.6.06.2005; Etch: 15.08.2007; Lenny 
(unstable version): 15.12.2007; several other Lenny ver- 
sions from 18.03.2008 to 05.05.2008 in intervals of 7 days. 

Figure [H shows the number of packages in the first four 
successive versions of Debian Linux with more than C 
in-directed links, which is nothing but the un- normalized 
complementary cumulative (or survival) distribution of 
package numbers of in-directed links. Zipf's law is con- 
firmed over four full decades, for each of the four releases. 
Notwithstanding the large modifications between releases 
and the multiplication of the number of packages by a fac- 
tor of three between Woody and Lenny, the distributions 
shown in FiglUare all consistent with Zipf's law. It is re- 
markable that no noticeable cut-off or change of regimes 
occurs neither at the left nor at the right end-parts of the 
distributions shown in FigHJ Our results extend those 
conjectured in Ref. [2l| for Red Hat Linux. By using 
Debian Linux, which is better suited for the sampling of 
projects than the often used SourceForge collaboration 
platform, we avoid biases and gather unique information 
only available in an integrated environment [22|. 

To understand the origin of this Zipf's law, we use the 
general framework of stochastic growth models, and we 
track the time evolution of a given package via its num- 
ber C of in-directed links connecting it to other packages 
within Debian Linux. The increment dC of the number 
of in-directed links to a given package over a small time 
interval dt is assumed to be the sum of two contributions, 
defining a generalized diffusion process: 



dC = r{C) dt + a{C) dW , 



(2) 



with r{C) is the average deterministic growth of the in- 
directed link number, cr{C) is the standard deviation of 
the stochastic component of the growth process and dW 
is the increment of the Wiener process (with {dW) = 
and {dW"^) = dt where the brackets denote perform- 
ing the statistical average). Zipf's law has been shown 
to arise under a variety of conditions associated with 
Gibrat's law. The simplest implementation of Gibrat's 
law writes that both r{C) and (j{C) are proportional to 



r{C) 



xC 



aiC) 



xC 



(3) 



with proportionality coefficients r and a obeying the fol- 
lowing inequality r < a. This later inequality expresses 
that the proportional growth is dominated by its stochas- 
tic component [23| . Accordingly, the heavy tail structure 
of Zipf's law can be thought of as the result of large 
stochastic multiplicative excursions. The rest of the let- 
ter is devoted to testing and validating this model. 

First, we measure the time evolution of the in-directed 
links of all packages in the successive Debian releases. 
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FIG. 2: Left panel: Plots of AC versus C from the Etch 
release (15.08.2007) to the latest Lenny version (05.05.2008) in 
double logarithmic scale. Only positive values are displayed. 
The linear regression AC = i? x C + Co is significant at the 
95% confidence level, with a small value Co = 0.3 at the 
origin and R — 0.09. Right panel: same as left panel for the 
standard deviation of AC. 



by retrieving the network of dependencies following the 
methodology explained in Ref. [22|. For packages which 
are common to successive releases, we find that their con- 
nectivity, measured for instance by their number C of 
in-directed links, increases on average albeit with con- 
siderable fluctuations. Consider for instance the up- 
date from Etch (15.08.2007) to the latest Lenny version 
(05.05.2008). For each package i which is common to 
these two versions, we measure the increment AQ of the 
number Ci of in-directed links to that package from Etch 
to the latest Lenny version. The left panel of Fig|2] plots 
these increments AC^ as a function of Ci. This figure 
is typical of the results obtained on the increments AC^ 
between other pairs of Debian releases. The scatter plot 
confirms the existence of an approximate proportionality 
between ACi and Ci, especially for the largest Ci val- 
ues, in agreement with the first equation of (|3]). The 
right panel of FigO shows the standard deviation of AC 
as a function of C, confirming the second equation of 
(J3j). These two panels are nothing but direct evidence 
of Gibrat's law for package connectivities, which consti- 
tutes an essential ingredient of stochastic growth models 
of Zipf's law 0, [13, In, [l^. Notice that the large scat- 
ter decorating the approximate proportionality between 
ACi and Ci observed in Fig. [2] and quantified in the right 
panel of Fig|2] is an essential ingredient for Zipf's law to 
appear [23|. 

We then combine ([2]) and (|3]) to predict that, over a 
not too large time interval At, (i) the average growth 
rate R{/^t) = {AC/C) should be given by 

R{At) = r X At , (4) 

and (ii) the standard deviation of the growth rate 

^{At)^{[AC/Cr)i (5) 
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should be equal to 



E(At) = cr X VAt 



(6) 
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FIG. 3: Dependence of R{At) and S(At) defined respectively 
by R{At) = (AC/C) and ([5|) as a function of their time inter- 
val At for the 66 time intervals that can be formed between all 
the Debian releases in our database (which includes the four 
major Debian releases from 19.07.2002 to 15.12.2007 as weh 
as the several Lenny releases from 18.03.2008 to 05.05.2008 in 
intervals of 7 days). The error bars show the 95% confidence 
intervals, obtained by shuffling 1000 times the linear regres- 
sion residuals. The straight lines represent the best linear 
fits. The existence of a genuine linear dependence of i? as a 
function of At cannot be rejected {p < 0.05) and has a high 
signifiance level (square of correlation coefficient IZ^ = 0.93). 
The regression of S versus >/At enjoys the same high statis- 
tical confidence (p < 0.05 and IZ^ — 0.97). 



This last result derives from the properties of the Wiener 
process increments dW. We test these two predictions 
([3]) and (|6|) as follows. Out of the four major Debian re- 
leases from 19.07.2002 to 15.12.2007 as well as the several 
Lenny releases from 18.03.2008 to 05.05.2008 in intervals 
of 7 days, 66 different time intervals can be formed. For 
each time interval, we calculate the average growth rate 
defined by R{At) = {AC/C) and its standard deviation 
defined by ([5]). Technically, we estimate R{At) (respec- 
tively i;(At)) as the slope (respectively the standard de- 
viation of the residuals) of the linear regression of AC 
as a function of C This method allows us to construct 
confidence bounds by bootstrapping (we reshuffle 1000 
times the linear regression residuals). The left (resp. 
right) panel of figure [3] shows the 66 values of R{At) 
(resp. I](At)) as a function of their corresponding time 
interval At (resp. square-root of At), providing a strong 
validation of the stochastic growth model ([2|) and ([3]). 

We now address the question of how the increase of 
the number of packages interacts with the growth pro- 
cess of the number of links between packages. This is- 
sue has been considered in the context of firms (see Ref. 
[23| for a detailed presentation and summary of the lit- 
erature), and applies to the case of packages as follows. 
Most stochastic growth models of firms based on Gibrat's 
principle attempt to derive the distribution of the cross- 
section of firm sizes directly from the distribution of the 
asset value of a single firm as a function of time. In- 
deed, many models start with the implicit or explicit 
assumption that the set of firms was born at the same 
origin of time. This approach is mathematically equiv- 
alent to considering that the universe is made only of 
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of in-directed links of newly born packages has a tail thin- 
ner than Zipf 's law, and converges progressively to Zipf 's 
law as the time elapsed between two releases increases, 
reflecting the increasing impact of the stochastic multi- 
plicative growth process. This confirms that Zipf's law 
results indeed from the stochastic multiplicative growth 
process at the level of individual packages in the presence 
of the birth-death of packages. 



FIG. 4: The right panel shows that the exponent fi of the 
distribution of C's of new packages appearing between suc- 
cessive unstable Lenny releases separated by one week is a 
power law with exponent /i ^ 1.5; the left panel show that 
the same power law has a smaller exponent closer to 1 as one 
considers the new packages appearing between two more dis- 
tant releases. We have verified that this effect is systematic 
in our database. The exponents fi are obtained by maximum 
likelihood, adapted to the discreteness of C values. The thin 
lines defined the 95% confidence intervals. 



one single entity. Therefore, the distribution of firm sizes 
can reach a steady-state if and only if the distribution 
of the asset value of a single firm reaches a steady state, 
which is counter factual. A more correct model is to take 
into account the fact that firms do not appear all at the 
same time but are born according to a more or less reg- 
ular flow of newly created firms. Competing with the 
birth process, firms also disappear at a surprisingly high 
rate. Similarly, the evolution of successive Debian re- 
leases is punctuated by additions and deletions of many 
packages. For instance, at the release of the latest stable 
release (Lenny, 15.12.2007), 885 packages disappeared, 
partly merged, or were renamed while 2983 packages ap- 
peared compared to the precedent release. Clearly, the 
dynamics of the connectivity between packages depends 
on the birth as well as demise of packages. Therefore, the 
stochastic growth model (|2]) must be supplemented by a 
model of the birth and death of packages. For this, we use 
Saichev et al. [23|'s approach, who showed that Gibrat's 
law of proportionate growth does not need to be strictly 
satisfied in the presence of the birth and death of entities 
following the stochastic growth process ([2]): as long as the 
volatility E(At) defined by ([5j) increases asymptotically 
proportionally to C and that the instantaneous growth 
rate increases not faster than the volatility, the distribu- 
tion of sizes follows Zipf's law. This suggests that the 
occurrence of very large firms in the distribution of sizes 
described by Zipf's law is more a consequence of random 
growth than systematic returns. Likewise, in particu- 
lar for packages with large connectivities, volatility can 
dominate over the instantaneous growth rate. 

Figure [4] verifies that the distribution of the numbers C 
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